Project: Investigate TMDb Movie Dataset¶

                                                                           Prepared by: Nouf AlGhmadi

Table of Contents¶

  • Introduction
  • Data Wrangling
  • Exploratory Data Analysis
  • Conclusions
  • Limitations

1. Introduction¶

About TMDB Website:
TMDB is a community-built movie and TV database, specialised around movie, TV and actor data, the wesite currently have 4.4 millions (4,419,249) titles, 810,747 movies, and 139,637 tv shows in its database, as well as 83 million registered users. 1

The Movie Database (TMDB)
This Dataset is from Kaggle website, it contains 21 variables for 10866 movies (10571 unique movies), for 55 years, beginning in 1960 and ending in 2015. Thousands of actors and actresses are listed, as well as 2399 unique director names.
Additionally, the data includes vote_average, which indicates the average ratings the movie received, the length of the movie in minutes, the release date of the movie, as well as its budget and profit.

Data Sources:
Original data on Kaggle


Investigation Questions¶

  1. What is the Financial Performance of Movies Over Time?
  2. What is the Financial Performance (Budget and Profit) of the Top 10 Most Rating Movies?
  3. Which Movies Genres Companies Mostly Invest In?
  4. Which Movies Genres are the Most Protofable?
  5. How can the vote count effect the profits of a movie?
In [1]:
# Packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import plotly.express as px

%matplotlib inline

2. Data Wrangling¶

2.1 Load Data¶

In [2]:
# Load the data and print out a few lines using head() method.

PATH_URL = "https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv"

df_url = pd.read_csv (PATH_URL)
df_url.head(3) #print the first 3 records
Out[2]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08

3 rows × 21 columns

2.2 Data Inspection¶

In [3]:
print('Shape of the data is ',df_url.shape) # return a tuple with the lengths of the corresponding array dimensions.
Shape of the data is  (10866, 21)

Note:
This dataset has 10866 Observations, and 21 Variables.

In [4]:
df_url.columns #print all the indicators
Out[4]:
Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')

Data Dictionary¶

  1. id: This is infact the movie_id as in the first dataset.
  2. imdb_id: A unique identifier for each movie.
  3. popularity: A numeric quantity specifying the movie popularity.
  4. budget: The budget in which the movie was made.
  5. revenue: The worldwide revenue generated by the movie.
  6. original_title: The title of the movie before translation or adaptation.
  7. cast: cast of the movie.
  8. homepage: A link to the homepage of the movie.
  9. director: director of the movie.
  10. tagline: Movie's tagline.
  11. keywords: The keywords or tags related to the movie.
  12. overview: A brief description of the movie.
  13. runtime: The running time of the movie in minutes.
  14. genre: The genre of the movie, Action, Comedy ,Thriller etc.
  15. production_companies: The production house of the movie.
  16. release_date: The date on which it was released.
  17. vote_count: the count of votes recieved.
  18. vote_average: average ratings the movie recieved.
  19. release_year: The year in which it was released.
  20. budget_adj
  21. revenue_adj
In [5]:
# Perform operations to inspect data types and look for instances of missing 
#    or possibly errant data.

df_url.info() # prints concise summary of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

Note:
From the data types we see that all features are object/int64/float64. we can see that release_date is an object type and we will try parsing the date using pandas.to_datetime function so it will be more useful.

In [6]:
# check how many NaN values and in which columns and converted to Ratio
percent_missing = round(df_url.isnull().sum() * 100 / len(df_url),2)
print("Variables Missing Ratio:")
print(percent_missing)
Variables Missing Ratio:
id                       0.00
imdb_id                  0.09
popularity               0.00
budget                   0.00
revenue                  0.00
original_title           0.00
cast                     0.70
homepage                72.98
director                 0.40
tagline                 25.99
keywords                13.74
overview                 0.04
runtime                  0.00
genres                   0.21
production_companies     9.48
release_date             0.00
vote_count               0.00
vote_average             0.00
release_year             0.00
budget_adj               0.00
revenue_adj              0.00
dtype: float64

Note:

  • Looking at the result above, we clearly can see the homepage column has approximately 73% null values which is a huge amount, wherefore the homepage column will be dropped since its not necessary (aren't relevant to our questions).
  • Likewise the tagline and the overview columns, the tagline has approximately 26% null values and the overview has 0.04% null values which is very tiny amount but since the column is not relavent it will also be droped.
  • The third largest column with null values is the keywords column.
  • We can see that 9.48% of production_companies column is empty so to solve this issue we will fillin these null values with "Unkown" keyword, as well as the director and the cast columns.
  • About the imdb_id column there's 0.09% of its values is null, and after observing the column values which always start with tt letters attached to 7 random generated numbers, for instance (tt0369610), therefore I figured out that it can be filled with the same method that the imdb website generates for every movie.
  • Finally I decided to drop the null records in the keywords, and genres columns.
In [7]:
df_url.describe(exclude=[object]) # descriptive statistics excluding object columns
Out[7]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09
In [8]:
df_url.describe(exclude=[np.number]) # descriptive statistics excluding numeric columns 
Out[8]:
imdb_id original_title cast homepage director tagline keywords overview genres production_companies release_date
count 10856 10866 10790 2936 10822 8042 9373 10862 10843 9836 10866
unique 10855 10571 10719 2896 5067 7997 8804 10847 2039 7445 5909
top tt0411951 Hamlet Louis C.K. http://www.thehungergames.movie/ Woody Allen Based on a true story. woman director No overview found. Comedy Paramount Pictures 1/1/09
freq 2 4 6 4 45 5 134 13 712 156 28
In [9]:
df_url[df_url.duplicated()] # find the duplicate rows, which will be removed
Out[9]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
2090 42194 tt0411951 0.59643 30000000 967000 TEKKEN Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian... NaN Dwight H. Little Survival is no game ... In the year of 2039, after World Wars destroy ... 92 Crime|Drama|Action|Thriller|Science Fiction Namco|Light Song Films 3/20/10 110 5.0 2010 30000000.0 967000.0

1 rows × 21 columns

Note:
We observe that there is one duplicate row in our dataset, which will be removed.

In [10]:
df_url["production_companies"].value_counts() # counts of unique values
Out[10]:
Paramount Pictures                       156
Universal Pictures                       133
Warner Bros.                              84
Walt Disney Pictures                      76
Columbia Pictures                         72
                                        ... 
Silverwood Films                           1
Lunch Box Entertainment                    1
Atitude Produções e Empreendimentos      1
Mulmur Feed Company                        1
Norm-Iris                                  1
Name: production_companies, Length: 7445, dtype: int64
In [11]:
df_url['vote_average'].unique() # print unique values
Out[11]:
array([6.5, 7.1, 6.3, 7.5, 7.3, 7.2, 5.8, 7.6, 8. , 6.2, 5.2, 7.4, 6.1,
       7. , 6.8, 5.3, 7.8, 6.4, 6.6, 7.7, 5.6, 6.9, 5.9, 6.7, 5.5, 5. ,
       4.4, 5.4, 5.1, 4.8, 5.7, 4.1, 3.9, 4.5, 6. , 4.2, 3.6, 4.3, 4.9,
       4.7, 4. , 3.5, 3.8, 3.3, 3.7, 4.6, 7.9, 8.2, 2.6, 3.1, 8.9, 3.2,
       2.4, 8.4, 3. , 2.8, 3.4, 8.8, 8.1, 8.3, 2.7, 2.5, 2.1, 8.6, 2.9,
       8.5, 9.2, 2.2, 2. , 8.7, 2.3, 1.5])
In [12]:
#    1st step :  Drop unwanted cloumns / aren't relevant to our questions
df_url = df_url.drop(['homepage', 'tagline', 'overview'],axis = 1)
#print(df_url)
In [13]:
#    2nd step :  Replaced null values in production_companies, cast, director with Unknown keywords
df_url = df_url.fillna({'production_companies':'Unknown', 'cast': 'Unknown', 'director': 'Unknown'})
#print(df_url)
In [14]:
#    3rd step :  Drop duplicate row in the dataset
df_url.drop_duplicates(inplace=True)
df_url.reset_index(drop=True, inplace=True)
df_url.shape
#print(df_url)
Out[14]:
(10865, 18)
In [15]:
df_url.isnull().sum() #check if the three steps above have worked
Out[15]:
id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                       0
director                   0
keywords                1493
runtime                    0
genres                    23
production_companies       0
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64
In [16]:
#    4th step :  Fill in imdb_id null records with random numbers generator
import random
import string
import numpy as np

def imdb_id_generator(): # function that returns 7 random numbers with "tt" before it 
    randomString = ''.join(random.choice(string.digits) for i in range(7)) #Determine the Length of the random string as 7
    imdb_id_value = 'tt' + randomString # add "tt" before the numbers
    print(imdb_id_value ) # print the values 
    return imdb_id_value # return the values 

# Go through imdb_id column and check if a record is null, if so fill it with the new generated id number
df_url['imdb_id'] = df_url.apply(lambda v: imdb_id_generator() if pd.isnull(v['imdb_id']) else v['imdb_id'], axis=1) 
tt3966309
tt1493971
tt5252547
tt6229659
tt1262669
tt9481106
tt3889494
tt6544259
tt6320783
tt7031313
In [17]:
duplicate = df_url[df_url.duplicated('imdb_id')] # Finally we will check if there's any duplicate records after generating the numbers above

duplicate # print the records
Out[17]:
id imdb_id popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj

Note:
We can see there's no duplicated records.

In [18]:
#    5th step :  Drop null records in keywords, genres columns

df_url = df_url.dropna(subset = ['keywords', 'genres']) # Drop all rows with NaNs in keywords and genres
In [19]:
df_url.isnull().sum()
Out[19]:
id                      0
imdb_id                 0
popularity              0
budget                  0
revenue                 0
original_title          0
cast                    0
director                0
keywords                0
runtime                 0
genres                  0
production_companies    0
release_date            0
vote_count              0
vote_average            0
release_year            0
budget_adj              0
revenue_adj             0
dtype: int64

Note:
As the output shown, we get rid of / replaced all the missing values.
Now the data is clean and ready to use.

In [20]:
#    6th step - part(1) :  change relase_date type to datetime to be useful
df_url['release_date'] = pd.to_datetime(df_url['release_date'])
df_url['release_date'].dtype

#type(df_url['release_date'])
Out[20]:
dtype('<M8[ns]')
In [21]:
#    6th step - part(2) :  Generate (month,day) columns from release_date column
df_url['release_day'] = (df_url['release_date']).dt.day
df_url['release_month'] = (df_url['release_date']).dt.month

df_url.head(3)
Out[21]:
id imdb_id popularity budget revenue original_title cast director keywords runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj release_day release_month
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.5 2015 1.379999e+08 1.392446e+09 9 6
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller future|chase|post-apocalyptic|dystopia|australia 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 2015-05-13 6185 7.1 2015 1.379999e+08 3.481613e+08 13 5
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... Robert Schwentke based on novel|revolution|dystopia|sequel|dyst... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 2015-03-18 2480 6.3 2015 1.012000e+08 2.716190e+08 18 3

Exploratory Data Analysis¶

Question (1): What is the Financial Performance of Movies Over Time¶

In [22]:
#  - - - - Question(1) - 1st step: Create two dataframes for movies financial performance both grouped by years:
budget_year = df_url.groupby('release_year')['budget'].sum().reset_index(name = "budget")
revenue_year = df_url.groupby('release_year')['revenue'].sum().reset_index(name = "revenue")

# Change "budget_year" and "revenue_year" from object to dataframes type:
budget_year =  pd.DataFrame(budget_year) 
revenue_year =  pd.DataFrame(revenue_year)

# Merge the twondataframes to one:
financial_performance_df = budget_year.merge(revenue_year) 
financial_performance_df # print the new dataframe
Out[22]:
release_year budget revenue
0 1960 19056948 137905000
1 1961 46137000 337720188
2 1962 54722126 215579846
3 1963 73265000 187404989
4 1964 39483161 340981782
5 1965 70205115 458081854
6 1966 57554800 84736689
7 1967 100652200 737834637
8 1968 71939000 264732980
9 1969 41954087 243957076
10 1970 126961946 560221969
11 1971 75997000 404910610
12 1972 36279254 494730171
13 1973 65190783 1223981102
14 1974 76970000 812539818
15 1975 56279000 957489966
16 1976 122150000 801005600
17 1977 161580000 2180583159
18 1978 208997000 1353386648
19 1979 254814000 1684794913
20 1980 362500000 1768662387
21 1981 368612000 1754700877
22 1982 437795002 2444443852
23 1983 519107412 2297358384
24 1984 711211964 2537395361
25 1985 717520637 2825489230
26 1986 684033613 2956898540
27 1987 707655716 3386522116
28 1988 898923000 3725222566
29 1989 1067656360 5128933486
30 1990 1289922066 5298128773
31 1991 1453433000 4684739920
32 1992 1425325538 5999868833
33 1993 1748628653 6949429765
34 1994 2185807032 7086791105
35 1995 2819384377 9152179144
36 1996 3600042051 8278301474
37 1997 4681086675 10606757132
38 1998 4473560000 9492543717
39 1999 5624775106 11252268548
40 2000 5517200000 10724897412
41 2001 5562700000 13359237296
42 2002 5635640255 14102793054
43 2003 6114488167 15062817728
44 2004 7064140208 16704911389
45 2005 7194212349 16440981223
46 2006 7146782800 16156482955
47 2007 7289319004 19057469082
48 2008 7424116809 18637699537
49 2009 8223334056 22028168149
50 2010 8716734449 21359123379
51 2011 8717847976 23334587180
52 2012 8010784010 24384579732
53 2013 8739303605 24032354172
54 2014 7732194045 24024972183
55 2015 7445797557 26635501691
In [23]:
#  - - - - Question(1) - 2nd step: Create a plot for movies financial performance grouped by years:

palette1 = ["#8BC1C2","#91B9A3"] # colour palette(1)

fig_budget_revenue = px.area(financial_performance_df, x='release_year', y=['budget','revenue'], color_discrete_sequence = palette1,
          title="The Financial Performance of Movies Over Time", labels={'release_year':'year', 'value':'Thousands of millions ($)', 'variable':'Financial Performance'}  )\
    .update_traces(dict(marker_line_width=1, marker_line_color="black"))

fig_budget_revenue.update_layout({
    'plot_bgcolor': 'rgba(0, 0, 0, 0)',
    'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})


fig_budget_revenue.show()

Insight:¶

Clearly, we can see that over the years both the movie industry expenses and the revenues have been increasing exponentially, reaching the highest point in 2015 with $26 billion USD. which indicates the increase of the investors' interest in the industry due to its high profitability.


Question(2): What is the Financial Performance (Budget and Profit) of the Top 10 Most Rating Movies¶

In [24]:
#  - - - - Question(2) - 1st step: Calculate the profit value for each movie and determine if it's profitable or not

df_url['profit_value'] = df_url['revenue'] - df_url['budget']

df_url['profitable'] = np.where(df_url['revenue'] > df_url['budget'], 1, 0) #determine whether the movie is (profitable = 1) or (not = 0)


df_url.head(2)
Out[24]:
id imdb_id popularity budget revenue original_title cast director keywords runtime ... release_date vote_count vote_average release_year budget_adj revenue_adj release_day release_month profit_value profitable
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124 ... 2015-06-09 5562 6.5 2015 1.379999e+08 1.392446e+09 9 6 1363528810 1
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller future|chase|post-apocalyptic|dystopia|australia 120 ... 2015-05-13 6185 7.1 2015 1.379999e+08 3.481613e+08 13 5 228436354 1

2 rows × 22 columns

In [25]:
#  - - - - Question(2) - 2nd step: Select the top 10 movies by most rating to analys thier financial performence 

top_rating_movies = df_url.sort_values('vote_count',ascending=False).head(10)
top_rating_movies
Out[25]:
id imdb_id popularity budget revenue original_title cast director keywords runtime ... release_date vote_count vote_average release_year budget_adj revenue_adj release_day release_month profit_value profitable
1919 27205 tt1375666 9.363643 160000000 825500000 Inception Leonardo DiCaprio|Joseph Gordon-Levitt|Ellen P... Christopher Nolan loss of lover|dream|sleep|subconsciousness|heist 148 ... 2010-07-14 9767 7.9 2010 1.600000e+08 8.255000e+08 14 7 665500000 1
4360 24428 tt0848228 7.637767 220000000 1519557910 The Avengers Robert Downey Jr.|Chris Evans|Mark Ruffalo|Chr... Joss Whedon new york|shield|marvel comic|comic|superhero 143 ... 2012-04-25 8903 7.3 2012 2.089437e+08 1.443191e+09 25 4 1299557910 1
1386 19995 tt0499549 9.432768 237000000 2781505847 Avatar Sam Worthington|Zoe Saldana|Sigourney Weaver|S... James Cameron culture clash|future|space war|space colony|so... 162 ... 2009-12-10 8458 7.1 2009 2.408869e+08 2.827124e+09 10 12 2544505847 1
2874 155 tt0468569 8.466668 185000000 1001921825 The Dark Knight Christian Bale|Michael Caine|Heath Ledger|Aaro... Christopher Nolan dc comics|crime fighter|secret identity|scarec... 152 ... 2008-07-16 8432 8.1 2008 1.873655e+08 1.014733e+09 16 7 816921825 1
4363 68718 tt1853728 5.944518 100000000 425368238 Django Unchained Jamie Foxx|Christoph Waltz|Leonardo DiCaprio|K... Quentin Tarantino bounty hunter|hero|plantation|society|friendship 165 ... 2012-12-25 7375 7.7 2012 9.497443e+07 4.039911e+08 25 12 325368238 1
4381 70160 tt1392170 2.571099 75000000 691210692 The Hunger Games Jennifer Lawrence|Josh Hutcherson|Liam Hemswor... Gary Ross hallucination|dystopia|female protagonist|bow ... 142 ... 2012-03-12 7080 6.7 2012 7.123082e+07 6.564734e+08 12 3 616210692 1
5424 68721 tt1300854 4.946136 200000000 1215439994 Iron Man 3 Robert Downey Jr.|Gwyneth Paltrow|Guy Pearce|D... Shane Black terrorist|war on terror|tennessee|malibu|marve... 130 ... 2013-04-18 6882 6.9 2013 1.872067e+08 1.137692e+09 18 4 1015439994 1
4362 49026 tt1345836 6.591277 250000000 1081041287 The Dark Knight Rises Christian Bale|Michael Caine|Gary Oldman|Anne ... Christopher Nolan dc comics|crime fighter|terrorist|secret ident... 165 ... 2012-07-16 6723 7.5 2012 2.374361e+08 1.026713e+09 16 7 831041287 1
629 157336 tt0816692 24.949134 165000000 621752480 Interstellar Matthew McConaughey|Jessica Chastain|Anne Hath... Christopher Nolan saving the world|artificial intelligence|fathe... 169 ... 2014-11-05 6498 8.0 2014 1.519800e+08 5.726906e+08 5 11 456752480 1
4366 49051 tt0903624 4.218933 250000000 1017003568 The Hobbit: An Unexpected Journey Ian McKellen|Martin Freeman|Richard Armitage|A... Peter Jackson riddle|elves|dwarves|orcs|middle-earth (tolkien) 169 ... 2012-11-26 6417 6.9 2012 2.374361e+08 9.658933e+08 26 11 767003568 1

10 rows × 22 columns

In [26]:
#  - - - - Question(2) - 3nd step: financial performence of the top 10 most rating movie


top_rating_financial_fig = px.bar(top_rating_movies, x=['budget','profit_value',], y='original_title',
            labels={'original_title':'Movie', 'budget': 'Budget', 'profit_value':'Profit', 'variable':'Financial Performance', 'value':'Thousands of millions ($)' },
            text_auto='.3s',
            color_discrete_sequence=palette1 ,
            title="Financial Performence of The Top 10 Most Rating Movies")

top_rating_financial_fig.update_layout({
    'plot_bgcolor': 'rgba(0, 0, 0, 0)',
    'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})

top_rating_financial_fig.show()

Insight:¶

Avatar was the most commercially successful movie of all time, earning 2.54 billion U.S. dollars at the box office. It is evident that the most successful movies of all time with the highest vote are the ones with the greatest profitability.


Question(3): Which Movies Genres Companies Mostly Invest In?¶

In [27]:
df_url['genres'].value_counts() # print unique values of genres
Out[27]:
Drama                                     621
Comedy                                    584
Drama|Romance                             264
Comedy|Drama                              253
Documentary                               243
                                         ... 
Documentary|Drama|Music|Thriller            1
Comedy|Horror|Thriller|Mystery|Fantasy      1
Comedy|Family|Mystery|TV Movie              1
Mystery|Drama|Adventure                     1
Mystery|Comedy                              1
Name: genres, Length: 1870, dtype: int64
In [28]:
#  - - - - Question(3) - 1st step: Sum of budgets for each genres (Grouped by)

budget_genres = df_url.groupby('genres')['budget'].sum().reset_index(name = "budget_genres")
budget_genres = budget_genres.sort_values('budget_genres',ascending=False).head(10) # get the 10 highest budgets of gebres
budget_genres
Out[28]:
genres budget_genres
640 Comedy 6547735670
915 Drama 5621027156
788 Comedy|Romance 3269566906
1115 Drama|Romance 2611808060
715 Comedy|Drama|Romance 2126509807
682 Comedy|Drama 1965552010
304 Action|Thriller 1961700000
601 Animation|Family 1822500000
483 Adventure|Fantasy|Action 1776900000
104 Action|Adventure|Science Fiction 1594500001
In [29]:
#  - - - - Question(2) - 2nd step: bar plot


palette2 = ["#F4B592","#D2BAC8","#E5BDB2","#D6C0B6","#CBDBCE","#F5A7A0","#8BC1C2","#91B9A3"]


top_budget_genres_fig = px.bar(budget_genres, x='budget_genres', y='genres',
            labels={'genres':'Genres', 'budget_genres': 'Budget', 'value':'Thousands of millions ($)'},
            text_auto='.3s',
            color_discrete_sequence = palette2 ,
            title="Movies Genres with highest investments")

top_budget_genres_fig.update_layout({
    'plot_bgcolor': 'rgba(0, 0, 0, 0)',
    'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})

top_budget_genres_fig.show()

Insight:¶

The plot above revealed that comedy was the highest investments genre with a budget of 6.5 billion U.S. dollars, though Drama, Comedy|Romance, and Drama|Romance were similarly high in investments.


Question(4): Which Movies Genres are the Most Protofable?¶

In [30]:
#  - - - - Question(4) - 1st step: Sum of profits for each genres (Grouped by)

profit_genres = df_url.groupby('genres')['profit_value'].sum().reset_index(name = "profit_genres")
profit_genres = profit_genres.sort_values('profit_genres',ascending=False).head(10) # get the 10 highest budgets of gebres
profit_genres
Out[30]:
genres profit_genres
640 Comedy 13237560812
915 Drama 9302298294
788 Comedy|Romance 8061551095
483 Adventure|Fantasy|Action 5820583556
601 Animation|Family 4874729242
104 Action|Adventure|Science Fiction 4825984307
515 Adventure|Fantasy|Family 4799516484
715 Comedy|Drama|Romance 4708543340
364 Adventure|Action|Science Fiction 4362926988
1115 Drama|Romance 4165058837
In [31]:
#  - - - - Question(2) - 2nd step: bar plot


palette3 = ["#51C3C5","#91B9A3","#51C3C5","#F1B2A0","#8B575C","#C98986","#F6BDD1","#F6E4F6","#857991"]

top_profit_genres_fig = px.bar(profit_genres, x='profit_genres', y='genres',
            labels={'genres':'Genres', 'profit_genres': 'Profit', 'value':'Thousands of millions ($)'},
            text_auto='.3s',
            color_discrete_sequence = palette3 ,
            title="Movies Genres with highest Profits")

top_profit_genres_fig.update_layout({
    'plot_bgcolor': 'rgba(0, 0, 0, 0)',
    'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})

top_profit_genres_fig.show()

Insight:¶

Between 1960 and 2015, Comedy was the highest profitable movie genre at the box office. Comedy films recorded a box office revenue of almost 13.2 billion U.S. dollars within that period. Drama movies followed with more than 9 billion dollars in revenue.

In [32]:
profit_genres1 = df_url.groupby('genres')['profit_value'].sum().reset_index(name = "profit_genres")
profit_genres1 = profit_genres.sort_values('profit_genres',ascending=True).head(10) # get the 10 highest budgets of gebres
profit_genres1
Out[32]:
genres profit_genres
1115 Drama|Romance 4165058837
364 Adventure|Action|Science Fiction 4362926988
715 Comedy|Drama|Romance 4708543340
515 Adventure|Fantasy|Family 4799516484
104 Action|Adventure|Science Fiction 4825984307
601 Animation|Family 4874729242
483 Adventure|Fantasy|Action 5820583556
788 Comedy|Romance 8061551095
915 Drama 9302298294
640 Comedy 13237560812

Question(5): How Can the Vote Count Rffect the Profits of a Movie?¶

In [33]:
import seaborn as sns

pallete1 = ["#F5A7A0","#91B9A3"]

fig, ax = plt.subplots(figsize=(16, 8))
voting_profit_fig =sns.scatterplot(x="revenue", y="vote_count",
              hue="profitable", palette=pallete1, legend='full',
              data=df_url);

Insight:¶

It can be seen from the graph that movies with a higher revenue also have a higher vote count. 1 means it's protifable and 0 mean its not.

Conclusions¶

Although investment in the movie industry can actually be a complex and risky endeavor, it is one of the most reliable industry in getting profits, And according to this experimental investigation, we find that it’s definitely a hugely profitable business.

In this report, we investigated the relationship between box office revenue(financial statement) and different variables, including vote counting, and genre, and how it can influence the financial success of a film, but there might be different major factors including casts, date of release, keywords, and popularity have statistically significant in generating box office revenue.

Among the factors that determine movie attendance, genre was found to be the most significant. we examined after analys our data that choosing a popular genre would be positively correlated with high box office revenue. The most profitable movies consists of 10 most popular genres, including Comedy, Drama, Comedy|Romance, Adventure|Fantasy|Action, Animation|Family, Action|Adventure|Science Fiction, Comedy|Drama|Romance, Adventure|Action|Science Fiction, and Drama|Romance, while the remaining genres such as Fantasy, Family, Mystery, Documentary, and Horror are classified as ‘unpopular’.

Also looking to the genres of the most profitable movies we can observe that there are different genre classifications assigned to the same movie in an attempt to attract a broad audience in order to maximize box office revenue.

Additionally, most of the high vote-count movies were found to experience significant increases in revenue, which confirms our hypothesis. Therefore, we can conclude that the vote count of a movie does, in fact, have a significant impact on domestic box office revenue.

Finally, A profit value was calculated as a better indicator of a movie's financial success

Limitations:¶

One of the limitations that I challenged, There was some difficulty in obtaining data about revenue and budget costs of movies that were associated with the high vote-average, when trying to analyze the correlation between the vote-average and the box office revenues, which has to be ignored in this study, despite the fact that voting average plays a statistically significant role in increasing box office revenue.

However, the alternative approach was to use the vote-count to determine the financial success of a movie.

In [ ]: